Load Required Packages

The pacman package provides a convenient way to load packages. It installs the package before loading if it not already installed.One of my favorite themes that I use with ggplot is the theme_pubclean. Here I set all themes with ggplot by it.

#install.packages("ggpubr")

library(pacman)

pacman::p_load(tidyverse,janitor,DataExplorer,skimr,ggpubr,viridis)


theme_set(theme_pubclean())

Introduction

In this post, I would like to go through some common methods of data exploration. Data exploration is one of the introductory analysis that is performed before any model building task. Data exploration can uncover some hidden patterns and lead to insights into the some phenomenom behind the data.It can inform the selection of appropriate statistical techniques,tools and models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the causes of the observed phenomena in the data. We can also detect outliers and anomalies in the data through exploration. Exploratory analysis emphasizes graphical visualizations of the data.

The data for this analysis Orange Juice data, is contained in the ISLR package.The ISLR package created to store the data for the popular introductory statistical learning text, Introduction to Statistical Learning with Applications in R (Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani).The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.The categorical response variable is Purchase with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice. The goal of this data is to predict which of the two brands of orange juice did customers want to buy based on some seventeen features which describes the product and nature of the customers. The dataset can be downloaded here. It contains 1070 observations and seveenteen features plus the response variable purchase.

Description of Variables:

  1. WeekofPurchase: Week of purchase
  2. StoreID: Store ID
  3. PriceCH: Price charged for CH
  4. PriceMM: Price charged for MM
  5. DiscCH: Discount offered for CH
  6. DiscMM: Discount offered for MM
  7. SpecialCH: Indicator of special on CH
  8. SpecialMM: Indicator of special on MM
  9. LoyalCH: Customer brand loyalty for CH
  10. SalePriceMM: Sale price for MM
  11. SalePriceCH: Sale price for CH
  12. PriceDiff: Sale price of MM less sale price of CH
  13. Store7: A factor with levels No and Yes indicating whether the sale is at Store 7
  14. PctDiscMM: Percentage discount for MM
  15. PctDiscCH: Percentage discount for CH
  16. ListPriceDiff: List price of MM less list price of CH
  17. STORE: store id.
# Import dataset
orangejuice<-read_csv('https://raw.githubusercontent.com/NanaAkwasiAbayieBoateng/ExploratoryDataAnalysis/master/orangejuice.csv')

write_csv(orangejuice,"orangejuice.csv")

orangejuice%>%head()
## # A tibble: 6 x 18
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
##   <chr>             <int>   <int>   <dbl>   <dbl>  <dbl>  <dbl>     <int>
## 1 CH                  237       1    1.75    1.99   0       0           0
## 2 CH                  239       1    1.75    1.99   0       0.3         0
## 3 CH                  245       1    1.86    2.09   0.17    0           0
## 4 MM                  227       1    1.69    1.69   0       0           0
## 5 CH                  228       7    1.69    1.69   0       0           0
## 6 CH                  230       7    1.69    1.99   0       0           0
## # ... with 10 more variables: SpecialMM <int>, LoyalCH <dbl>,
## #   SalePriceMM <dbl>, SalePriceCH <dbl>, PriceDiff <dbl>, Store7 <chr>,
## #   PctDiscMM <dbl>, PctDiscCH <dbl>, ListPriceDiff <dbl>, STORE <int>

Univariate Analysis

plot_str(orangejuice)

There are 40 missing observations in the data set.In this exploratory analysis we would simply delete these missing values. Imputing missing values would be discussed extensively in a later post.When the number of missing values is relative to the sample size is small in a data set, a basic approach to handling missing data is to delete them.

plot_missing(orangejuice)
Fig. 30

Fig. 30

An alternate visualization approach is to use visna function from the extracat package.The columns represent the variables in the data and the rows the missing pattern.The blue cells represent cells of the variable with missing values.The proportion of missing values for each variable is shown by the bars vertically beneath cells.The right show the relative frequencies of patterns.

pacman::p_load(extracat)


extracat::visna(orangejuice, sort = "b", sort.method="optile", fr=100, pmax=0.05, s = 2)
Fig. 30

Fig. 30

plot_histogram(orangejuice)
Fig. 30

Fig. 30

plot_density(orangejuice)
Fig. 30

Fig. 30

plot_bar(orangejuice)
Fig. 30

Fig. 30

Purchases made at store store 7 is lower than other stores whereas more customers purchased Citrus Hill than Minute Maid Orange Juice

Multivariate Analysis

Multivariate analysis include examining the correlation structure between variables in the dataset and also the bivariate relationship between the response variable and each predictor variable.

pacman::p_load(GGally)

na.omit(orangejuice)%>%select_if(is.double)%>%ggpairs(  title = "Continuous Variables")
Fig. 30

Fig. 30

Multiple continuous variables can be visualized by Parallel Coordinate Plots (PCP). Each vertical axis represents a column variable in the data and the observations are drawn as lines connecting its value on the corresponding vertical axes. The ggplot extension GGally package has the ggparcoord function which can be used for PCP plots in R. High values for Week of purchase corresponds with stores with low ID numbers. Low values for Indicator of special on MM corresponds with higher customer loyalty

p_ <- GGally::print_if_interactive

# this time, color by diamond cut
p <- ggparcoord(data = na.omit(orangejuice), columns = c(2:10), groupColumn = "Purchase", title = "Parallel Coord. Plot of Orange Juice Data",scale = "uniminmax", boxplot = FALSE, mapping = ggplot2::aes(size = 1),showPoints = TRUE,alpha = .05,)+
  #scale_fill_viridis(discrete = T)+
  
    scale_fill_manual(values=c("#B9DE28FF" , "#D1E11CFF" ))+
   ggplot2::scale_size_identity()
p_(p)
na.omit(orangejuice)%>%select_if(is.double)%>%
  mutate(Above_Avg = PriceCH > mean(PriceCH)) %>%
  GGally::ggparcoord(showPoints = TRUE,
 
    alpha = .05,
    scale = "center",
    columns = 1:8,
    groupColumn = "Above_Avg"
    )
Fig. 30

Fig. 30

Correlation between numeric variables can also be visualized by a heatmap. Heatmaps can identify clusters with strong correlation among variables. The correlation matrix between the variables can be visualized neatly on a heatmap. e the correlation matrix and visualize this matrix with a heatmap. Deep points represent low correlations whereas light yellow represents strong correlations. There exist strong correlations among variable pairs such as (WeekofPurchase, Price) ,( PctDisc, SalePrice )for both CH and MM, ( ListPriceDiff, PriceMM) etc.

plot_correlation(na.omit(orangejuice),type = "continuous",theme_config = list(legend.position = "bottom", axis.text.x =
  element_text(angle = 90)))
Fig. 30

Fig. 30

The corrplot function can also equivalently plot the correlatio between variables in a dataset as shown below:

pacman::p_load(plotly,corrr,RColorBrewer,corrplot)



na.omit(orangejuice)%>%select_if(is.numeric)%>%cor()%>%corrplot::corrplot()
Fig. 30

Fig. 30

#Equivalently
#na.omit(orangejuice)%>%select_if(is.numeric)%>%cor()%>%
#  corrplot.mixed(upper = "color", tl.col = "black")
na.omit(orangejuice)%>%
  select_if(is.numeric) %>%
  cor() %>%
  heatmap(Rowv = NA, Colv = NA, scale = "column")
Fig. 30

Fig. 30

An interactive heatmap can be easily plotted courtesy the d3heatmap package.

pacman::p_load(d3heatmap)

na.omit(orangejuice)%>%
  select_if(is.numeric) %>%
  cor() %>%
d3heatmap(colors = "Blues", scale = "col",
          dendrogram = "row", k_row = 3)

Fig. 30

library(devtools)

#install_github("easyGgplot2", "kassambara")


pacman::p_load(ggalt,gridExtra,scales,kassambara,easyGgplot2)


p1<-ggplot(orangejuice, aes(x=SalePriceCH, fill=Purchase)) + geom_bkde(alpha=0.5)
p2<-ggplot(orangejuice, aes(x=SalePriceMM, fill=Purchase)) + geom_bkde(alpha=0.5)



# Multiple graphs on the same page
easyGgplot2::ggplot2.multiplot(p1,p2, cols=2)
Fig. 30

Fig. 30

The sale price for both purchased Citrus Hill and Minute Maid Orange Juice is multimodal and the Citrus Hill has a higher sale price.

skimr::skim(orangejuice)%>%kable()
## Skim summary statistics  
##  n obs: 1070    
##  n variables: 18    
## 
## Variable type: character
## 
##  variable    missing    complete     n      min    max    empty    n_unique 
## ----------  ---------  ----------  ------  -----  -----  -------  ----------
##  Purchase       0         1070      1070     2      2       0         2     
##   Store7        0         1070      1070     2      3       0         2     
## 
## Variable type: integer
## 
##     variable       missing    complete     n       mean      sd      p0     p25    p50    p75    p100      hist   
## ----------------  ---------  ----------  ------  --------  -------  -----  -----  -----  -----  ------  ----------
##    SpecialCH          2         1068      1070     0.15     0.35      0      0      0      0      1      ▇▁▁▁▁▁▁▂ 
##    SpecialMM          5         1065      1070     0.16     0.37      0      0      0      0      1      ▇▁▁▁▁▁▁▂ 
##      STORE            2         1068      1070     1.63     1.43      0      0      2      3      4      ▇▃▁▅▁▅▁▃ 
##     StoreID           1         1069      1070     3.96     2.31      1      2      3      7      7      ▃▅▅▃▁▁▁▇ 
##  WeekofPurchase       0         1070      1070    254.38    15.56    227    240    257    268    278     ▆▅▅▃▅▇▆▇ 
## 
## Variable type: numeric
## 
##    variable       missing    complete     n      mean      sd        p0       p25     p50     p75     p100      hist   
## ---------------  ---------  ----------  ------  -------  -------  ---------  ------  ------  ------  ------  ----------
##     DiscCH           2         1068      1070    0.052    0.12        0        0       0       0      0.5     ▇▁▁▁▁▁▁▁ 
##     DiscMM           4         1066      1070    0.12     0.21        0        0       0      0.23    0.8     ▇▁▁▂▁▁▁▁ 
##  ListPriceDiff       0         1070      1070    0.22     0.11        0       0.14    0.24    0.3     0.44    ▂▂▂▂▇▆▁▁ 
##     LoyalCH          5         1065      1070    0.57     0.31     1.1e-05    0.32    0.6     0.85     1      ▅▂▃▃▆▃▃▇ 
##    PctDiscCH         2         1068      1070    0.027    0.062       0        0       0       0      0.25    ▇▁▁▁▁▁▁▁ 
##    PctDiscMM         5         1065      1070    0.059     0.1        0        0       0      0.11    0.4     ▇▁▁▂▁▁▁▁ 
##     PriceCH          1         1069      1070    1.87      0.1      1.69      1.79    1.86    1.99    2.09    ▂▅▁▇▁▁▅▁ 
##    PriceDiff         1         1069      1070    0.15     0.27      -0.67      0      0.23    0.32    0.64    ▁▁▂▂▃▇▃▂ 
##     PriceMM          4         1066      1070    2.09     0.13      1.69      1.99    2.09    2.18    2.29    ▁▁▁▃▁▇▃▂ 
##   SalePriceCH        1         1069      1070    1.82     0.14      1.39      1.75    1.86    1.89    2.09    ▁▁▁▂▆▇▅▁ 
##   SalePriceMM        5         1065      1070    1.96     0.25      1.19      1.69    2.09    2.13    2.29    ▁▁▃▃▁▂▇▆

The skimr and mlr packages have functions that conveniently summaeizes a dataset and present the output in a tabular form.

skimmed <-skim_to_wide(orangejuice)
skimmed
## # A tibble: 18 x 17
##    type   variable missing complete n     min   max   empty n_unique mean 
##    <chr>  <chr>    <chr>   <chr>    <chr> <chr> <chr> <chr> <chr>    <chr>
##  1 chara… Purchase 0       1070     1070  2     2     0     2        <NA> 
##  2 chara… Store7   0       1070     1070  2     3     0     2        <NA> 
##  3 integ… Special… 2       1068     1070  <NA>  <NA>  <NA>  <NA>     "  0…
##  4 integ… Special… 5       1065     1070  <NA>  <NA>  <NA>  <NA>     "  0…
##  5 integ… STORE    2       1068     1070  <NA>  <NA>  <NA>  <NA>     "  1…
##  6 integ… StoreID  1       1069     1070  <NA>  <NA>  <NA>  <NA>     "  3…
##  7 integ… WeekofP… 0       1070     1070  <NA>  <NA>  <NA>  <NA>     254.…
##  8 numer… DiscCH   2       1068     1070  <NA>  <NA>  <NA>  <NA>     0.052
##  9 numer… DiscMM   4       1066     1070  <NA>  <NA>  <NA>  <NA>     "0.1…
## 10 numer… ListPri… 0       1070     1070  <NA>  <NA>  <NA>  <NA>     "0.2…
## 11 numer… LoyalCH  5       1065     1070  <NA>  <NA>  <NA>  <NA>     "0.5…
## 12 numer… PctDisc… 2       1068     1070  <NA>  <NA>  <NA>  <NA>     0.027
## 13 numer… PctDisc… 5       1065     1070  <NA>  <NA>  <NA>  <NA>     0.059
## 14 numer… PriceCH  1       1069     1070  <NA>  <NA>  <NA>  <NA>     "1.8…
## 15 numer… PriceDi… 1       1069     1070  <NA>  <NA>  <NA>  <NA>     "0.1…
## 16 numer… PriceMM  4       1066     1070  <NA>  <NA>  <NA>  <NA>     "2.0…
## 17 numer… SalePri… 1       1069     1070  <NA>  <NA>  <NA>  <NA>     "1.8…
## 18 numer… SalePri… 5       1065     1070  <NA>  <NA>  <NA>  <NA>     "1.9…
## # ... with 7 more variables: sd <chr>, p0 <chr>, p25 <chr>, p50 <chr>,
## #   p75 <chr>, p100 <chr>, hist <chr>
mlr::summarizeColumns(orangejuice)
##              name      type na         mean        disp median        mad
## 1        Purchase character  0           NA  0.38971963     NA         NA
## 2  WeekofPurchase   integer  0 254.38130841 15.55828614 257.00 20.7564000
## 3         StoreID   integer  1   3.95696913  2.30818860   3.00  1.4826000
## 4         PriceCH   numeric  1   1.86742750  0.10201723   1.86  0.1482600
## 5         PriceMM   numeric  4   2.08503752  0.13442854   2.09  0.1334340
## 6          DiscCH   numeric  2   0.05195693  0.11756276   0.00  0.0000000
## 7          DiscMM   numeric  4   0.12341463  0.21412552   0.00  0.0000000
## 8       SpecialCH   integer  2   0.14700375  0.35427555   0.00  0.0000000
## 9       SpecialMM   integer  5   0.16244131  0.36902846   0.00  0.0000000
## 10        LoyalCH   numeric  5   0.56520304  0.30807037   0.60  0.3891084
## 11    SalePriceMM   numeric  5   1.96193427  0.25250999   2.09  0.1482600
## 12    SalePriceCH   numeric  1   1.81551918  0.14344425   1.86  0.1482600
## 13      PriceDiff   numeric  1   0.14632367  0.27163786   0.23  0.1482600
## 14         Store7 character  0           NA  0.33271028     NA         NA
## 15      PctDiscMM   numeric  5   0.05938809  0.10184138   0.00  0.0000000
## 16      PctDiscCH   numeric  2   0.02731794  0.06228113   0.00  0.0000000
## 17  ListPriceDiff   numeric  0   0.21799065  0.10753545   0.24  0.0889560
## 18          STORE   integer  2   1.62827715  1.43049727   2.00  1.4826000
##          min        max nlevs
## 1   4.17e+02 653.000000     2
## 2   2.27e+02 278.000000     0
## 3   1.00e+00   7.000000     0
## 4   1.69e+00   2.090000     0
## 5   1.69e+00   2.290000     0
## 6   0.00e+00   0.500000     0
## 7   0.00e+00   0.800000     0
## 8   0.00e+00   1.000000     0
## 9   0.00e+00   1.000000     0
## 10  1.10e-05   0.999947     0
## 11  1.19e+00   2.290000     0
## 12  1.39e+00   2.090000     0
## 13 -6.70e-01   0.640000     0
## 14  3.56e+02 714.000000     2
## 15  0.00e+00   0.402010     0
## 16  0.00e+00   0.252688     0
## 17  0.00e+00   0.440000     0
## 18  0.00e+00   4.000000     0
(spec_variables <- attr(orangejuice, "spec"))
## cols(
##   Purchase = col_character(),
##   WeekofPurchase = col_integer(),
##   StoreID = col_integer(),
##   PriceCH = col_double(),
##   PriceMM = col_double(),
##   DiscCH = col_double(),
##   DiscMM = col_double(),
##   SpecialCH = col_integer(),
##   SpecialMM = col_integer(),
##   LoyalCH = col_double(),
##   SalePriceMM = col_double(),
##   SalePriceCH = col_double(),
##   PriceDiff = col_double(),
##   Store7 = col_character(),
##   PctDiscMM = col_double(),
##   PctDiscCH = col_double(),
##   ListPriceDiff = col_double(),
##   STORE = col_integer()
## )
spec_variables<-c("LoyalCH", "SalePriceMM","SalePriceCH" ,"PctDiscMM","PctDiscCH","ListPriceDiff","Purchase","Store7")


spec_variable<-noquote(spec_variables)
 
pm<-ggpairs(orangejuice,spec_variable , title = "",mapping = aes(color = Purchase))+
  theme(legend.position = "top")

pm
Fig. 30

Fig. 30

We can select one of plots above as follows:

pm[1,7]
Fig. 30

Fig. 30

na.omit(orangejuice)%>% select_if(~!is.double(.x))%>%
  ggpairs( mapping = aes(color = Purchase) , title = "Categorical Variables")+
  theme(legend.position = "top")
Fig. 30

Fig. 30

#Equivalently

#na.omit(orangejuice)%>% select_if(funs(!is.double(.)))%>%
 # ggpairs(  title = "Categorical Variables")


#index=!sapply(na.omit(orangejuice), is.double)
#orange_numeric<-orangejuice[index==TRUE]
#orange_numeric%>%ggpairs(  title = "Categorical Variables")



#na.omit(orangejuice)%>%select_if(negate(is.double))%>%
#  ggpairs(  title = "Categorical Variables")
categorical_orange=na.omit(orangejuice)%>% select_if(~!is.double(.x))
continuous_orange=na.omit(orangejuice)%>% select_if(is.double)

categorical_orange<-noquote(names(categorical_orange))
continuous_orange<-noquote(names(continuous_orange))


ggduo(
  orangejuice, rev(continuous_orange), categorical_orange,
  mapping = aes(color = Purchase),
   types = list(continuous = wrap("smooth_loess", alpha = 0.25)),
  showStrips = FALSE,
  title = "Variable Comparison By Purchase",
  xlab = "Continuous Variables",
  ylab = "Categorical",
  legend = c(5,2)
) +
  theme(legend.position = "top")
Fig. 30

Fig. 30